Skip to content

Feat/final synthesis compaction#1273

Open
Aayushjshah wants to merge 3 commits intomainfrom
feat/finalSynthesisCompaction
Open

Feat/final synthesis compaction#1273
Aayushjshah wants to merge 3 commits intomainfrom
feat/finalSynthesisCompaction

Conversation

@Aayushjshah
Copy link
Collaborator

@Aayushjshah Aayushjshah commented Mar 9, 2026

Description

Testing

Additional Notes

Summary by CodeRabbit

  • New Features

    • Final answer synthesis now supports sectional organization, intelligent image prioritization within token budgets, and adaptive streaming
    • Enhanced model token limit detection with configurable output constraints
  • Improvements

    • Optimized streaming architecture for final answers with improved error handling
    • More robust model metadata normalization, caching, and fallback mechanisms when services are unavailable

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 9, 2026

📝 Walkthrough

Walkthrough

This PR introduces a comprehensive Final Answer Synthesis module to orchestrate end-to-end generation of structured user responses with support for single-shot and sectional modes, alongside a significant refactor of model metadata handling with caching, normalization, and token limit resolution.

Changes

Cohort / File(s) Summary
Final Answer Synthesis Module
server/api/chat/final-answer-synthesis.ts
New 1686-line module providing complete final answer synthesis pipeline with token budgeting, fragment preview generation, section planning/mapping, image selection, streaming support, and both single-shot and sectional synthesis modes. Exports FinalSynthesisExecutionResult and supporting types, with internal testing surface via __finalAnswerSynthesisInternals.
Final Answer Synthesis Tests
server/tests/finalAnswerSynthesis.test.ts
New 285-line test suite validating fragment preview generation, sectioned payload construction, mode switching logic, fragment budget selection, and fallback handling for oversized content.
Agent Integration
server/api/chat/message-agents.ts
Refactored final synthesis flow to delegate to new executeFinalSynthesis and related helpers, removing 312 lines of in-file synthesis logic. Now imports and re-exports buildFinalSynthesisPayload from final-answer-synthesis module. Simplified streaming path and result handling.
Model Metadata and Caching
server/ai/fetchModels.ts
Comprehensive refactor introducing model info caching layer (+622/-347 lines). Adds normalization utilities, cost parsing, token limit resolution, and new public API (fetchModelInfoFromAPI, fetchNormalizedModelMetadata, getModelTokenLimits, getEffectiveMaxOutputTokens, getCostConfigForModel, getLiteLLMWorkflowModels). Includes __modelInfoInternals testing surface with cache management helpers.
Model Metadata Tests
server/tests/fetchModels.test.ts
New 128-line test suite validating model record normalization, conflict detection, token limit resolution via enum IDs and provider names, and effective output token clamping behavior.
Provider and Workflow Updates
server/ai/provider/base.ts, server/api/workflow.ts
Updated getModelParams to dynamically resolve token limits using new getEffectiveMaxOutputTokens. Simplified GetModelEnumsApi to directly call getLiteLLMWorkflowModels() instead of multi-step filtering/transformation (+2/-75 lines).
Configuration and Utilities
server/shared/types.ts, server/logger/index.ts
Added maxInputTokens and maxOutputTokens optional fields to ModelConfiguration. Removed unused pino levels import.

Sequence Diagram(s)

sequenceDiagram
    participant Agent as Agent Flow
    participant Synth as Final Answer Synthesis
    participant Planner as Section Planner
    participant Mapper as Fragment Mapper
    participant LLM as Language Model
    participant Store as Result/Stream
    
    Agent->>Synth: executeFinalSynthesis(context)
    Synth->>Synth: decideSynthesisMode(budget)
    
    alt Mode: Sectional
        Synth->>Planner: planSections(fragments, budget)
        Planner->>LLM: generate section plan
        LLM-->>Planner: section definitions
        Planner-->>Synth: SectionPlanResult
        
        Synth->>Mapper: mapFragmentsToSections(fragments, plan)
        Mapper->>LLM: assign fragments to sections
        LLM-->>Mapper: mappings
        Mapper-->>Synth: SectionMappingEnvelope
        
        loop Per Section
            Synth->>Synth: selectImagesForSection(budget)
            Synth->>LLM: synthesizeSection(section, fragments, images)
            LLM-->>Synth: section answer
            Synth->>Store: stream section result
        end
    else Mode: Single
        Synth->>Synth: selectImagesForFinalSynthesis(budget)
        Synth->>LLM: synthesizeSingleAnswer(fragments, images)
        LLM-->>Synth: final answer
        Synth->>Store: stream answer chunks
    end
    
    Synth-->>Agent: FinalSynthesisExecutionResult
Loading
sequenceDiagram
    participant App as Application
    participant Cache as Model Info Cache
    participant API as External Model API
    participant Norm as Normalization
    participant Store as Normalized Metadata
    
    App->>Cache: fetchNormalizedModelMetadata(forceRefresh)
    
    alt Cache Valid
        Cache-->>App: cached metadata
    else Cache Invalid
        Cache->>API: fetchModelInfoFromAPI()
        API-->>Cache: raw model records
        Cache->>Norm: normalizeModelInfoRecords(raw)
        Norm->>Norm: resolve conflicts, compute limits
        Norm-->>Store: NormalizedModelMetadata
        Store-->>App: normalized metadata
    end
    
    App->>App: getModelTokenLimits(modelId)
    App->>App: getEffectiveMaxOutputTokens(requested)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

The PR spans multiple interconnected systems (final synthesis orchestration, model metadata normalization, provider integration) with significant logic density. The new final-answer-synthesis module introduces intricate budgeting, fallback handling, and streaming coordination across 1700+ lines. The fetchModels refactor introduces a new caching/normalization layer with conflict detection and token resolution logic. Integration points across message-agents, provider/base, and workflow require cross-file reasoning. However, changes are well-scoped and cohesive rather than scattered.

Possibly related PRs

Suggested labels

server, ai, chat, refactor

Suggested reviewers

  • zereraz
  • shivamashtikar
  • kalpadhwaryu
  • junaid-shirur
  • devesh-juspay
  • Himanshvarma

Poem

🐰 A synthesis blooms, fragments unite,
Models now cache their metadata right,
Sections take shape, tokens align,
Budgets respected, results so fine,
From chaos to answers, streaming with grace!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title "Feat/final synthesis compaction" accurately reflects the main change: introducing a comprehensive Final Answer Synthesis module with consolidated logic, centralized helpers, and refactored payload generation across multiple files. It is concise, specific, and clearly communicates the primary objective.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/finalSynthesisCompaction

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the AI agent's ability to generate final answers, particularly when dealing with extensive evidence. By introducing a flexible synthesis pipeline that can dynamically switch between single and sectional processing modes, the system can now efficiently handle large context windows, improve answer quality, and reduce the risk of context length errors. The changes also centralize and streamline the underlying logic, making the system more scalable and easier to maintain.

Highlights

  • Introduced Sectional Final Answer Synthesis: Implemented a new, robust final answer synthesis pipeline that can operate in either a 'single' mode for smaller contexts or a 'sectional' mode for larger evidence sets. This allows the AI agent to break down complex answers into manageable sections, process them in parallel, and then reassemble them, significantly improving scalability and context handling.
  • Dynamic Model Input Token Configuration: Added dynamic configuration for Large Language Model (LLM) maximum input tokens, allowing the system to adapt its synthesis strategy based on the specific model's context window capabilities. This includes a new utility function to retrieve the maximum input tokens for any given model.
  • Refactored Final Synthesis Logic: Centralized the final answer synthesis logic into a dedicated module (final-answer-synthesis.ts), decoupling it from the main message-agents.ts file. This improves modularity, readability, and maintainability by removing duplicated prompt building and image selection logic.
  • Enhanced Context Management for Synthesis: Developed sophisticated mechanisms for managing context during synthesis, including fragment preview generation, intelligent fragment batching within token budgets, and image selection strategies that prioritize recent and user-attached images while respecting model limits.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • server/ai/modelConfig.ts
    • Added DEFAULT_MAX_INPUT_TOKENS constant to define a baseline token limit.
    • Introduced MODEL_MAX_INPUT_TOKEN_OVERRIDES to specify custom maximum input token limits for various LLM models.
    • Implemented logic to apply these overrides to the MODEL_CONFIGURATIONS.
    • Exported getModelMaxInputTokens function to retrieve the appropriate maximum input token count for a given model.
  • server/api/chat/final-answer-synthesis.ts
    • Added new file containing the core logic for final answer synthesis.
    • Defined types for FinalSynthesisExecutionResult, SynthesisModeSelection, FinalSection, SectionAnswerResult, FragmentPreviewRecord, FragmentAssignmentBatch, SelectedImagesResult, and SectionMappingEnvelope.
    • Implemented utility functions for text normalization, truncation, and token estimation (estimateTextTokens, estimateImageTokens, estimatePromptTokens).
    • Included error classification for planner fallbacks (classifyPlannerFallbackError).
    • Developed functions for managing stop signals and streaming answer chunks (buildStoppedFinalSynthesisResult, logFinalSynthesisStop, cancelConverseIterator, streamFinalAnswerChunk).
    • Created estimateSafeInputBudget to calculate available token budget for input.
    • Provided functions to format plan and clarifications for prompts (formatPlanForPrompt, formatClarificationsForPrompt).
    • Built shared context for final answer generation (buildSharedFinalAnswerContext).
    • Defined base system prompts for final and sectional answers (buildBaseFinalAnswerSystemPrompt).
    • Implemented buildFinalSynthesisPayload for single-shot synthesis.
    • Added functions for sectional planning and mapping (formatSectionPlanOverview, buildPlannerSystemPrompt, buildMapperSystemPrompt, formatSectionFragments).
    • Developed logic for fragment preview records and budget-aware text inclusion (findTimestamp, buildFragmentPreviewRecord, formatPreviewRecord, buildPreviewOmissionSummary, buildPreviewTextWithinBudget).
    • Implemented buildFragmentBatchesWithinBudget for efficient fragment processing.
    • Included functions for normalizing and merging section plans and assignments (normalizeSectionPlan, normalizeSectionAssignments, mergeSectionAssignments).
    • Created runWithConcurrency for parallel processing of items.
    • Defined buildDefaultSectionPlan and buildDefaultAssignments for fallback scenarios.
    • Implemented image selection logic (createSelectedImagesResult, selectImagesForFinalSynthesis, selectImagesForFragmentIds).
    • Added selectMappedEntriesWithinBudget to trim fragments based on token limits.
    • Developed buildSectionAnswerPayload for individual section synthesis.
    • Implemented decideSynthesisMode to choose between single and sectional synthesis.
    • Provided asynchronous functions for planning sections (planSections) and mapping fragments (mapFragmentsToSections).
    • Implemented the core synthesis functions for single answers (synthesizeSingleAnswer) and individual sections (synthesizeSection).
    • Added assembleSectionAnswers to combine results from sectional synthesis.
    • Implemented synthesizeSectionalAnswer to orchestrate the sectional synthesis process.
    • Exported executeFinalSynthesis as the main entry point for final answer generation.
    • Exposed internal functions for testing via __finalAnswerSynthesisInternals.
  • server/api/chat/message-agents.ts
    • Imported executeFinalSynthesis, formatClarificationsForPrompt, and formatPlanForPrompt from the new final-answer-synthesis module.
    • Removed buildAgentSystemPromptContextBlock, formatFragmentsWithMetadata imports from message-agents-metadata as they are now handled by the new module.
    • Exported buildFinalSynthesisPayload from final-answer-synthesis.
    • Removed the local formatPlanForPrompt function, now using the imported version.
    • Removed the local formatClarificationsForPrompt function, now using the imported version.
    • Removed the local buildFinalSynthesisPayload function, now using the imported version.
    • Removed the local selectImagesForFinalSynthesis function, now using the imported version.
    • Refactored createFinalSynthesisTool to call the new executeFinalSynthesis function, simplifying its internal logic.
    • Updated reasoning emission messages to reflect the new sectional synthesis mode.
  • server/logger/index.ts
    • Removed unused import of levels from the pino library.
  • server/shared/types.ts
    • Added an optional maxInputTokens property to the ModelConfiguration interface.
  • server/tests/finalAnswerSynthesis.test.ts
    • Added new test file for final-answer-synthesis module.
    • Included tests for buildFragmentPreviewRecord to ensure deterministic fragment previews.
    • Tested buildSectionAnswerPayload to verify shared context and section-specific instructions.
    • Verified decideSynthesisMode correctly switches to sectional mode when the input budget is exceeded.
    • Added tests for selectMappedEntriesWithinBudget to confirm proper handling of oversized fragments.
    • Included a test for synthesizeSection to ensure sections are omitted when no fragments fit the budget.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new, sophisticated final answer synthesis mechanism, primarily implemented in a new file server/api/chat/final-answer-synthesis.ts. This module intelligently decides between a single-shot or a multi-sectional approach for generating answers, especially when dealing with large amounts of context and images, to stay within model token limits. It includes logic for planning sections, mapping fragments to these sections, and synthesizing each section concurrently. To support this, server/ai/modelConfig.ts was updated to store and retrieve maximum input token limits for various models, and server/shared/types.ts was modified to include maxInputTokens in the ModelConfiguration interface. The existing server/api/chat/message-agents.ts file was refactored to delegate the entire final synthesis process to the new module, simplifying its code. Additionally, a new test file server/tests/finalAnswerSynthesis.test.ts was added to validate the new synthesis logic, covering fragment preview generation, sectional payload construction, mode selection, and fragment budget management. A minor indentation issue was noted in the selectMappedEntriesWithinBudget function call within the new synthesis file.

Note: Security Review did not run due to the size of the PR.

Comment on lines +1308 to +1312
selectMappedEntriesWithinBudget(
orderedEntries,
baseTokens,
safeInputBudget,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's a minor indentation issue here that affects readability. The arguments to selectMappedEntriesWithinBudget are indented incorrectly.

Suggested change
selectMappedEntriesWithinBudget(
orderedEntries,
baseTokens,
safeInputBudget,
)
selectMappedEntriesWithinBudget(
orderedEntries,
baseTokens,
safeInputBudget,
)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
server/api/chat/message-agents.ts (1)

3092-3118: ⚠️ Potential issue | 🟠 Major

Reset the full final-synthesis state on every non-success exit.

review.lockedByFinalSynthesis is set before the streaming-channel guard and never cleared on the early return or either catch path. After a recoverable synthesis failure, the run will skip all later reviews/replanning, and the leftover streamedText / ackReceived can leak partial output into a retry or delegated fallback.

Proposed fix
     async execute(_args, context) {
       const mutableContext = mutableAgentContext(context)
-      if (!mutableContext.review.lockedByFinalSynthesis) {
-        mutableContext.review.lockedByFinalSynthesis = true
-        mutableContext.review.lockedAtTurn =
-          mutableContext.turnCount ?? MIN_TURN_NUMBER
-        loggerWithChild({ email: context.user.email }).info(
-          {
-            chatId: context.chat.externalId,
-            turn: mutableContext.review.lockedAtTurn,
-          },
-          "[MessageAgents][FinalSynthesis] Review lock activated after synthesis tool call."
-        )
-      }
+      const resetFinalSynthesisState = () => {
+        mutableContext.finalSynthesis.requested = false
+        mutableContext.finalSynthesis.completed = false
+        mutableContext.finalSynthesis.suppressAssistantStreaming = false
+        mutableContext.finalSynthesis.streamedText = ""
+        mutableContext.finalSynthesis.ackReceived = false
+        mutableContext.review.lockedByFinalSynthesis = false
+        mutableContext.review.lockedAtTurn = null
+      }
+
       if (
         mutableContext.finalSynthesis.requested &&
         mutableContext.finalSynthesis.completed
       ) {
         return ToolResponse.error(
@@
       if (!mutableContext.runtime?.streamAnswerText) {
         return ToolResponse.error(
           "EXECUTION_FAILED",
           "Streaming channel unavailable. Cannot deliver final answer."
         )
       }
+
+      if (!mutableContext.review.lockedByFinalSynthesis) {
+        mutableContext.review.lockedByFinalSynthesis = true
+        mutableContext.review.lockedAtTurn =
+          mutableContext.turnCount ?? MIN_TURN_NUMBER
+        loggerWithChild({ email: context.user.email }).info(
+          {
+            chatId: context.chat.externalId,
+            turn: mutableContext.review.lockedAtTurn,
+          },
+          "[MessageAgents][FinalSynthesis] Review lock activated after synthesis tool call."
+        )
+      }
@@
       } catch (error) {
+        resetFinalSynthesisState()
         if (isMessageAgentStopError(error)) {
-          context.finalSynthesis.suppressAssistantStreaming = false
-          context.finalSynthesis.requested = false
-          context.finalSynthesis.completed = false
           throw error
         }
-
-        context.finalSynthesis.suppressAssistantStreaming = false
-        context.finalSynthesis.requested = false
-        context.finalSynthesis.completed = false
         loggerWithChild({ email: context.user.email }).error(
           { err: error instanceof Error ? error.message : String(error) },
           "Final synthesis tool failed."
         )

Also applies to: 3160-3179

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/api/chat/message-agents.ts` around lines 3092 - 3118, The
final-synthesis state (e.g., mutableContext.review.lockedByFinalSynthesis,
mutableContext.review.lockedAtTurn and all fields under
mutableContext.finalSynthesis such as streamedText and ackReceived) is set
before the streaming-channel guard but not cleared on early returns or failure
paths; update the control flow so that any non-success exit (the early return
when runtime?.streamAnswerText is false, the "Final synthesis already completed"
path, and all catch/failure branches around the synthesis call) resets the
entire final-synthesis state to its initial/empty values and clears the review
lock (unset lockedByFinalSynthesis and lockedAtTurn) to avoid leaking partial
output into retries or later stages, ensuring you touch the same symbols
(mutableContext.review.lockedByFinalSynthesis,
mutableContext.review.lockedAtTurn, mutableContext.finalSynthesis.streamedText,
mutableContext.finalSynthesis.ackReceived, and
mutableContext.finalSynthesis.requested/completed).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@server/api/chat/message-agents.ts`:
- Around line 3092-3118: The final-synthesis state (e.g.,
mutableContext.review.lockedByFinalSynthesis, mutableContext.review.lockedAtTurn
and all fields under mutableContext.finalSynthesis such as streamedText and
ackReceived) is set before the streaming-channel guard but not cleared on early
returns or failure paths; update the control flow so that any non-success exit
(the early return when runtime?.streamAnswerText is false, the "Final synthesis
already completed" path, and all catch/failure branches around the synthesis
call) resets the entire final-synthesis state to its initial/empty values and
clears the review lock (unset lockedByFinalSynthesis and lockedAtTurn) to avoid
leaking partial output into retries or later stages, ensuring you touch the same
symbols (mutableContext.review.lockedByFinalSynthesis,
mutableContext.review.lockedAtTurn, mutableContext.finalSynthesis.streamedText,
mutableContext.finalSynthesis.ackReceived, and
mutableContext.finalSynthesis.requested/completed).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0bc99031-2b58-42ec-bc34-f430af9bf1ad

📥 Commits

Reviewing files that changed from the base of the PR and between f16ea5e and 9199f39.

📒 Files selected for processing (6)
  • server/ai/modelConfig.ts
  • server/api/chat/final-answer-synthesis.ts
  • server/api/chat/message-agents.ts
  • server/logger/index.ts
  • server/shared/types.ts
  • server/tests/finalAnswerSynthesis.test.ts

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (5)
server/ai/provider/base.ts (1)

23-33: Redundant fallback chain for modelId.

Line 25 already ensures actualModelId falls back to resolvedModelId (which itself falls back to defaultFastModel), but line 33 adds another || defaultFastModel fallback. This is unreachable.

🧹 Simplify redundant fallback
-    const actualModelId = modelConfig?.actualName || resolvedModelId || defaultFastModel
+    const actualModelId = modelConfig?.actualName || resolvedModelId
     const requestedMaxTokens = params.max_new_tokens || 1024 * 8
     return {
       maxTokens:
         getEffectiveMaxOutputTokens(resolvedModelId, requestedMaxTokens) ??
         requestedMaxTokens,
       topP: params.top_p || 0.9,
       temperature: params.temperature || 0.6,
-      modelId: actualModelId || defaultFastModel,
+      modelId: actualModelId,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/ai/provider/base.ts` around lines 23 - 33, The returned modelId has an
unreachable fallback: `actualModelId` is already computed as
`modelConfig?.actualName || resolvedModelId || defaultFastModel`, so remove the
redundant `|| defaultFastModel` in the return object; update the return's
`modelId` to just use `actualModelId` (ensuring `resolvedModelId`,
`modelConfig`, `actualModelId`, and `defaultFastModel` remain unchanged).
server/tests/fetchModels.test.ts (1)

94-111: Consider adding test for missing model metadata entirely.

The test covers the case where the model exists but has null token limits. Consider adding a test for when getModelTokenLimits is called with a model ID that doesn't exist in the cache at all to verify the warning behavior.

🧪 Optional: Add test for completely missing model
test("warns and returns defaults for model not in cache", () => {
  __modelInfoInternals.setModelInfoCacheForTests([])

  const limits = getModelTokenLimits("nonexistent-model")
  expect(limits).toEqual({
    maxInputTokens: 128_000,
    maxOutputTokens: undefined,
  })
  // Optionally verify warning was logged if you have a way to capture logs
})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/tests/fetchModels.test.ts` around lines 94 - 111, Add a test that
verifies behavior when the model ID is not present in the cache: call
__modelInfoInternals.setModelInfoCacheForTests([]) then call getModelTokenLimits
with a nonexistent model ID and assert it returns the generic defaults
(maxInputTokens 128_000 and maxOutputTokens undefined); also optionally call
getEffectiveMaxOutputTokens with a sample requested output (e.g., 1500) to
assert it preserves the requested output and, if feasible, assert a warning was
emitted.
server/ai/fetchModels.ts (3)

456-468: Silent clamping of requestedMaxTokens without caller notification.

getEffectiveMaxOutputTokens silently clamps the requested value to maxOutputTokens using Math.min(). Callers passing values larger than the model's upstream maximum (e.g., 8192 when max is 4096) won't know their request was reduced.

Consider logging when clamping occurs or returning additional metadata.

📝 Optional: Add debug logging when clamping
 export const getEffectiveMaxOutputTokens = (
   modelId: Models | string | null | undefined,
   requestedMaxTokens?: number,
 ): number | undefined => {
   if (requestedMaxTokens === undefined) {
     return undefined
   }

   const { maxOutputTokens } = getModelTokenLimits(modelId)
-  return maxOutputTokens !== undefined
-    ? Math.min(requestedMaxTokens, maxOutputTokens)
-    : requestedMaxTokens
+  if (maxOutputTokens !== undefined && requestedMaxTokens > maxOutputTokens) {
+    Logger.debug(
+      { modelId, requestedMaxTokens, maxOutputTokens },
+      "[ModelInfo] Clamped requested max tokens to model limit."
+    )
+    return maxOutputTokens
+  }
+  return requestedMaxTokens
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/ai/fetchModels.ts` around lines 456 - 468, getEffectiveMaxOutputTokens
silently reduces requestedMaxTokens to the model's maxOutputTokens (from
getModelTokenLimits) which can surprise callers; update
getEffectiveMaxOutputTokens to notify when clamping occurs by either (a)
emitting a debug/process logger message that includes the original
requestedMaxTokens, the modelId, and the applied maxOutputTokens when Math.min
reduces the value, or (b) change the function signature to return metadata
(e.g., { effectiveMaxTokens, wasClamped, allowedMax }) so callers can react;
locate the logic in getEffectiveMaxOutputTokens and the getModelTokenLimits call
and implement one of these notification approaches consistently across usages.

72-74: Module-level mutable state without synchronization.

The modelInfoCache and warnedMissingTokenLimitKeys are module-level mutable variables. While JavaScript/TypeScript is single-threaded, concurrent async operations (e.g., multiple simultaneous fetchModelInfoFromAPI calls) could lead to race conditions where one call overwrites the cache while another is reading it.

Consider using a mutex pattern or ensuring only one fetch is in-flight at a time.

♻️ Proposed pattern to deduplicate in-flight requests
 let modelInfoCache: ModelInfoCache | null = null
+let inFlightFetch: Promise<RawModelInfoRecord[]> | null = null

 const warnedMissingTokenLimitKeys = new Set<string>()

Then in fetchModelInfoFromAPI:

 export async function fetchModelInfoFromAPI(
   forceRefresh = false,
 ): Promise<RawModelInfoRecord[]> {
   if (!forceRefresh && modelInfoCache) {
     const age = Date.now() - modelInfoCache.timestamp
     if (age < CACHE_TTL_MS) {
       return modelInfoCache.rawData
     }
   }

+  if (inFlightFetch) {
+    return inFlightFetch
+  }
+
+  inFlightFetch = (async () => {
+    try {
       // ... existing fetch logic ...
+    } finally {
+      inFlightFetch = null
+    }
+  })()
+
+  return inFlightFetch
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/ai/fetchModels.ts` around lines 72 - 74, modelInfoCache and
warnedMissingTokenLimitKeys are module-level mutable state that can race when
multiple fetchModelInfoFromAPI calls run concurrently; change the implementation
to deduplicate and synchronize in-flight fetches by storing an in-flight Promise
(e.g., inFlightModelInfoPromise) keyed by model id or a single mutex-like
Promise, so fetchModelInfoFromAPI checks for an existing in-flight Promise and
awaits it instead of starting a new network call, only updating modelInfoCache
and warnedMissingTokenLimitKeys once the Promise resolves; ensure any rejection
clears the in-flight entry so subsequent calls can retry and keep references to
the unique symbols modelInfoCache, warnedMissingTokenLimitKeys, and
fetchModelInfoFromAPI when locating the code to modify.

733-745: Test internals exported unconditionally.

The __modelInfoInternals object exposes internal functions and cache mutation helpers in production builds. While useful for testing, this could allow unintended cache manipulation.

🧪 Optional: Guard test exports with environment check
-export const __modelInfoInternals = {
+export const __modelInfoInternals = process.env.NODE_ENV === "test" ? {
   getCachedNormalizedModelMetadata,
   normalizeModelInfoRecords,
   resolveModelMetadata,
   resetModelInfoCacheForTests: () => {
     modelInfoCache = null
     warnedMissingTokenLimitKeys.clear()
   },
   setModelInfoCacheForTests: (records: RawModelInfoRecord[]) => {
     warnedMissingTokenLimitKeys.clear()
     updateModelInfoCache(records)
   },
-}
+} : {} as typeof __modelInfoInternals
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/ai/fetchModels.ts` around lines 733 - 745, The __modelInfoInternals
export currently exposes internal helpers (getCachedNormalizedModelMetadata,
normalizeModelInfoRecords, resolveModelMetadata) and test-only cache mutators
(resetModelInfoCacheForTests, setModelInfoCacheForTests) unconditionally;
restrict this by only assigning/exporting __modelInfoInternals when running in a
test environment (e.g., check process.env.NODE_ENV === 'test' or an isTestEnv
flag) so production builds do not expose modelInfoCache or
warnedMissingTokenLimitKeys mutation or the updateModelInfoCache helper; keep
the same object shape for tests but ensure the conditional guards the export
assignment.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@server/ai/fetchModels.ts`:
- Around line 338-343: The current fallback path returns modelInfoCache.rawData
when config.LiteLLMApiKey is falsy without re-checking freshness; update the
logic in fetchModels.ts so that when config.LiteLLMApiKey is missing you still
compute age = Date.now() - modelInfoCache.timestamp and only return
modelInfoCache.rawData if age < CACHE_TTL_MS; if the cache is stale, do not
return stale data and instead log a clear warning (including timestamp/age and
that the API key is missing) before proceeding to the error/empty response
path—apply this change where forceRefresh, modelInfoCache, CACHE_TTL_MS and
config.LiteLLMApiKey are used and replace the unconditional return of
modelInfoCache.rawData with the freshness-checked behavior and logging.

In `@server/api/chat/final-answer-synthesis.ts`:
- Around line 281-297: The safeInputBudget calculation in
estimateSafeInputBudget currently forces a minimum of 1024 which can exceed the
actual remaining prompt capacity; replace the Math.max(1_024, maxInputTokens -
reservedOutputTokens) logic so safeInputBudget is clamped to the actual
remaining budget (maxInputTokens - reservedOutputTokens) and not
inflated—compute remaining = maxInputTokens - reservedOutputTokens, then set
safeInputBudget = Math.max(1, remaining) (or Math.max(0, remaining) if you
prefer allowing zero) to ensure it never exceeds the real remaining capacity.
- Around line 757-765: The single-section fallback created by
buildDefaultSectionPlan currently uses section-only instructions that suppress
producing a full final answer; change buildDefaultSectionPlan’s returned section
objective/title to explicitly instruct producing the complete final answer when
it’s the only section (e.g., “Produce the full final answer using mapped
evidence”). Also update the prompt/template logic that emits the “Do not write
the full final answer” instruction (the code that composes section-level
instructions where the string "Do not write the full final answer" appears) to
conditionally skip that prohibition when the plan contains exactly one section
so the fallback yields a complete answer.

---

Nitpick comments:
In `@server/ai/fetchModels.ts`:
- Around line 456-468: getEffectiveMaxOutputTokens silently reduces
requestedMaxTokens to the model's maxOutputTokens (from getModelTokenLimits)
which can surprise callers; update getEffectiveMaxOutputTokens to notify when
clamping occurs by either (a) emitting a debug/process logger message that
includes the original requestedMaxTokens, the modelId, and the applied
maxOutputTokens when Math.min reduces the value, or (b) change the function
signature to return metadata (e.g., { effectiveMaxTokens, wasClamped, allowedMax
}) so callers can react; locate the logic in getEffectiveMaxOutputTokens and the
getModelTokenLimits call and implement one of these notification approaches
consistently across usages.
- Around line 72-74: modelInfoCache and warnedMissingTokenLimitKeys are
module-level mutable state that can race when multiple fetchModelInfoFromAPI
calls run concurrently; change the implementation to deduplicate and synchronize
in-flight fetches by storing an in-flight Promise (e.g.,
inFlightModelInfoPromise) keyed by model id or a single mutex-like Promise, so
fetchModelInfoFromAPI checks for an existing in-flight Promise and awaits it
instead of starting a new network call, only updating modelInfoCache and
warnedMissingTokenLimitKeys once the Promise resolves; ensure any rejection
clears the in-flight entry so subsequent calls can retry and keep references to
the unique symbols modelInfoCache, warnedMissingTokenLimitKeys, and
fetchModelInfoFromAPI when locating the code to modify.
- Around line 733-745: The __modelInfoInternals export currently exposes
internal helpers (getCachedNormalizedModelMetadata, normalizeModelInfoRecords,
resolveModelMetadata) and test-only cache mutators (resetModelInfoCacheForTests,
setModelInfoCacheForTests) unconditionally; restrict this by only
assigning/exporting __modelInfoInternals when running in a test environment
(e.g., check process.env.NODE_ENV === 'test' or an isTestEnv flag) so production
builds do not expose modelInfoCache or warnedMissingTokenLimitKeys mutation or
the updateModelInfoCache helper; keep the same object shape for tests but ensure
the conditional guards the export assignment.

In `@server/ai/provider/base.ts`:
- Around line 23-33: The returned modelId has an unreachable fallback:
`actualModelId` is already computed as `modelConfig?.actualName ||
resolvedModelId || defaultFastModel`, so remove the redundant `||
defaultFastModel` in the return object; update the return's `modelId` to just
use `actualModelId` (ensuring `resolvedModelId`, `modelConfig`, `actualModelId`,
and `defaultFastModel` remain unchanged).

In `@server/tests/fetchModels.test.ts`:
- Around line 94-111: Add a test that verifies behavior when the model ID is not
present in the cache: call __modelInfoInternals.setModelInfoCacheForTests([])
then call getModelTokenLimits with a nonexistent model ID and assert it returns
the generic defaults (maxInputTokens 128_000 and maxOutputTokens undefined);
also optionally call getEffectiveMaxOutputTokens with a sample requested output
(e.g., 1500) to assert it preserves the requested output and, if feasible,
assert a warning was emitted.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: db99e037-ea90-4766-96fd-cee55546c6fd

📥 Commits

Reviewing files that changed from the base of the PR and between 9199f39 and 03e9341.

📒 Files selected for processing (7)
  • server/ai/fetchModels.ts
  • server/ai/provider/base.ts
  • server/api/chat/final-answer-synthesis.ts
  • server/api/workflow.ts
  • server/shared/types.ts
  • server/tests/fetchModels.test.ts
  • server/tests/finalAnswerSynthesis.test.ts

Comment on lines 338 to 343
if (!forceRefresh && modelInfoCache) {
const age = Date.now() - modelInfoCache.timestamp
if (age < CACHE_TTL_MS) {
return modelInfoCache.data
return modelInfoCache.rawData
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Cache freshness check passes stale data when API key is missing.

When config.LiteLLMApiKey is falsy, the function returns cached raw data (line 347) without checking whether that cache is stale. This is inconsistent with the freshness check at lines 338-343.

If the API key becomes unavailable after initial caching, stale data could be served indefinitely.

🔧 Proposed fix to log staleness when falling back without API key
   if (!config.LiteLLMApiKey) {
-    Logger.warn("LiteLLM API key not configured, returning empty array")
+    if (modelInfoCache) {
+      const age = Date.now() - modelInfoCache.timestamp
+      Logger.warn(
+        { cacheAgeMs: age },
+        "LiteLLM API key not configured, returning cached data"
+      )
+    } else {
+      Logger.warn("LiteLLM API key not configured, returning empty array")
+    }
     return modelInfoCache?.rawData ?? []
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/ai/fetchModels.ts` around lines 338 - 343, The current fallback path
returns modelInfoCache.rawData when config.LiteLLMApiKey is falsy without
re-checking freshness; update the logic in fetchModels.ts so that when
config.LiteLLMApiKey is missing you still compute age = Date.now() -
modelInfoCache.timestamp and only return modelInfoCache.rawData if age <
CACHE_TTL_MS; if the cache is stale, do not return stale data and instead log a
clear warning (including timestamp/age and that the API key is missing) before
proceeding to the error/empty response path—apply this change where
forceRefresh, modelInfoCache, CACHE_TTL_MS and config.LiteLLMApiKey are used and
replace the unconditional return of modelInfoCache.rawData with the
freshness-checked behavior and logging.

Comment on lines +281 to +297
function estimateSafeInputBudget(
modelId: Models,
maxOutputTokens?: number,
): { maxInputTokens: number; safeInputBudget: number } {
const { maxInputTokens } = getModelTokenLimits(modelId)
const effectiveMaxOutputTokens =
getEffectiveMaxOutputTokens(
modelId,
maxOutputTokens ?? FALLBACK_OUTPUT_TOKENS,
) ??
maxOutputTokens ??
FALLBACK_OUTPUT_TOKENS
const reservedOutputTokens = Math.ceil(
effectiveMaxOutputTokens * (1 + FINAL_OUTPUT_HEADROOM_RATIO),
)
const safeInputBudget = Math.max(1_024, maxInputTokens - reservedOutputTokens)
return { maxInputTokens, safeInputBudget }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Clamp safeInputBudget to the actual remaining budget.

Line 296 can inflate safeInputBudget above the real remaining capacity. For example, if a model only has a few hundred prompt tokens left after reserving output, this still forces the budget up to 1024, so decideSynthesisMode() can keep a payload that the provider will reject on context length.

Proposed fix
 function estimateSafeInputBudget(
   modelId: Models,
   maxOutputTokens?: number,
 ): { maxInputTokens: number; safeInputBudget: number } {
   const { maxInputTokens } = getModelTokenLimits(modelId)
   const effectiveMaxOutputTokens =
     getEffectiveMaxOutputTokens(
       modelId,
       maxOutputTokens ?? FALLBACK_OUTPUT_TOKENS,
     ) ??
     maxOutputTokens ??
     FALLBACK_OUTPUT_TOKENS
   const reservedOutputTokens = Math.ceil(
     effectiveMaxOutputTokens * (1 + FINAL_OUTPUT_HEADROOM_RATIO),
   )
-  const safeInputBudget = Math.max(1_024, maxInputTokens - reservedOutputTokens)
+  const remainingInputBudget = maxInputTokens - reservedOutputTokens
+  const safeInputBudget = Math.max(1, remainingInputBudget)
   return { maxInputTokens, safeInputBudget }
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
function estimateSafeInputBudget(
modelId: Models,
maxOutputTokens?: number,
): { maxInputTokens: number; safeInputBudget: number } {
const { maxInputTokens } = getModelTokenLimits(modelId)
const effectiveMaxOutputTokens =
getEffectiveMaxOutputTokens(
modelId,
maxOutputTokens ?? FALLBACK_OUTPUT_TOKENS,
) ??
maxOutputTokens ??
FALLBACK_OUTPUT_TOKENS
const reservedOutputTokens = Math.ceil(
effectiveMaxOutputTokens * (1 + FINAL_OUTPUT_HEADROOM_RATIO),
)
const safeInputBudget = Math.max(1_024, maxInputTokens - reservedOutputTokens)
return { maxInputTokens, safeInputBudget }
function estimateSafeInputBudget(
modelId: Models,
maxOutputTokens?: number,
): { maxInputTokens: number; safeInputBudget: number } {
const { maxInputTokens } = getModelTokenLimits(modelId)
const effectiveMaxOutputTokens =
getEffectiveMaxOutputTokens(
modelId,
maxOutputTokens ?? FALLBACK_OUTPUT_TOKENS,
) ??
maxOutputTokens ??
FALLBACK_OUTPUT_TOKENS
const reservedOutputTokens = Math.ceil(
effectiveMaxOutputTokens * (1 + FINAL_OUTPUT_HEADROOM_RATIO),
)
const remainingInputBudget = maxInputTokens - reservedOutputTokens
const safeInputBudget = Math.max(1, remainingInputBudget)
return { maxInputTokens, safeInputBudget }
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/api/chat/final-answer-synthesis.ts` around lines 281 - 297, The
safeInputBudget calculation in estimateSafeInputBudget currently forces a
minimum of 1024 which can exceed the actual remaining prompt capacity; replace
the Math.max(1_024, maxInputTokens - reservedOutputTokens) logic so
safeInputBudget is clamped to the actual remaining budget (maxInputTokens -
reservedOutputTokens) and not inflated—compute remaining = maxInputTokens -
reservedOutputTokens, then set safeInputBudget = Math.max(1, remaining) (or
Math.max(0, remaining) if you prefer allowing zero) to ensure it never exceeds
the real remaining capacity.

Comment on lines +757 to +765
function buildDefaultSectionPlan(): FinalSection[] {
return [
{
sectionId: 1,
title: "Answer",
objective: "Provide the best complete answer using the mapped evidence.",
},
]
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don’t use section-only instructions for the single-section fallback.

Lines 757-765 create a one-section fallback, but Lines 879-899 still tell the model “Do not write the full final answer” and that other sections are being generated in parallel. That makes the recovery path intentionally suppress a complete answer exactly when planning/mapping has already failed.

Proposed fix
 function buildSectionAnswerPayload(
   context: AgentRunContext,
   sections: FinalSection[],
   section: FinalSection,
   entries: FragmentAssignmentBatch[],
   imageFileNames: string[],
 ): { systemPrompt: string; userMessage: string; imageFileNames: string[] } {
+  const singleSection = sections.length === 1
   const sharedContext = buildSharedFinalAnswerContext(context)
   const sectionOverview = formatSectionPlanOverview(sections)
   const fragmentsText = formatSectionFragments(entries)
   const fragmentsSection = fragmentsText
     ? `Context Fragments For This Section:\n${fragmentsText}`
     : "Context Fragments For This Section:\nNone."
 
   const userMessage = [
     sharedContext,
-    `All Planned Sections (generated in parallel; a final ordered answer will be assembled later):\n${sectionOverview}`,
+    singleSection
+      ? ""
+      : `All Planned Sections (generated in parallel; a final ordered answer will be assembled later):\n${sectionOverview}`,
     `Assigned Section:\n${section.sectionId}. ${section.title}\nObjective: ${section.objective}`,
     [
       "Section Instructions:",
       "- Write only this section.",
-      "- Do not write the full final answer.",
-      "- Other sections are being generated in parallel.",
-      "- A final ordered answer will be assembled afterwards.",
+      ...(singleSection
+        ? ["- Deliver the complete final answer for this section."]
+        : [
+            "- Do not write the full final answer.",
+            "- Other sections are being generated in parallel.",
+            "- A final ordered answer will be assembled afterwards.",
+          ]),
       "- Avoid intro or outro language that assumes the whole answer is already visible.",
       "- Use the provided global fragment indexes exactly as shown for citations.",
     ].join("\n"),
     fragmentsSection,
   ]
     .filter(Boolean)
     .join("\n\n")
 
   return {
-    systemPrompt: buildBaseFinalAnswerSystemPrompt("section"),
+    systemPrompt: buildBaseFinalAnswerSystemPrompt(singleSection ? "final" : "section"),
     userMessage,
     imageFileNames,
   }
 }

Also applies to: 879-899

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/api/chat/final-answer-synthesis.ts` around lines 757 - 765, The
single-section fallback created by buildDefaultSectionPlan currently uses
section-only instructions that suppress producing a full final answer; change
buildDefaultSectionPlan’s returned section objective/title to explicitly
instruct producing the complete final answer when it’s the only section (e.g.,
“Produce the full final answer using mapped evidence”). Also update the
prompt/template logic that emits the “Do not write the full final answer”
instruction (the code that composes section-level instructions where the string
"Do not write the full final answer" appears) to conditionally skip that
prohibition when the plan contains exactly one section so the fallback yields a
complete answer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants